Approximation bounds for sparse principal component analysis
نویسندگان
چکیده
We produce approximation bounds on a semidefinite programming relaxation for sparse principal component analysis. These bounds control approximation ratios for tractable statistics in hypothesis testing problems where data points are sampled from Gaussian models with a single sparse leading component. We study approximation bounds for a semidefinite relaxation of the sparse eigenvalue problem, written here in penalized form max ‖x‖2=1 xΣx− ρCard(x) in the variable x ∈ Rn, where Σ ∈ Sn and ρ ≥ 0. Sparse eigenvalues appear in many applications in statistics and machine learning. Sparse eigenvectors are often used, for example, to improve the interpretability of principal component analysis, while sparse eigenvalues control recovery thresholds in compressed sensing [Candes and Tao, 2007]. Several convex relaxations and greedy algorithms have been developed to find approximate solutions (see d’Aspremont et al. [2007, 2008], Journée et al. [2008], Journée et al. [2008] among others), but except in simple scenarios where ρ is small and the two leading eigenvalues of Σ are separated, very little is known about the tightness of these approximation methods. Here, using randomization techniques based on [Ben-Tal and Nemirovski, 2002], we derive simple approximation bounds for the semidefinite relaxation derived in [d’Aspremont, Bach, and El Ghaoui, 2008]. We do not produce a constant approximation ratio and our bounds depend on the optimum value of the semidefinite relaxation: the higher this value, the better the approximation. A similar behavior was observed by Zwick [1999] for the semidefinite relaxation to MAXCUT, who showed that the classical approximation ratio of Goemans and Williamson [1995] can be improved when the value of the cut is high enough. We then show that, in some applications, it is possible to bound a priori the optimum value of the semidefinite relaxation, hence produce a lower bound on the approximation ratio. In particular, following recent works by [Amini and Wainwright, 2009, Berthet and Rigollet, 2012], we focus on the problem of detecting the presence of a (significant) sparse principal component in a Gaussian model, hence test the significance of eigenvalues isolated by sparse principal component analysis. More precisely, we apply our approximation results to the problem of discriminating between the two Gaussian models N (0, In) and N ( 0, In + θvv T ) where v ∈ Rn is a sparse vector with unit Euclidean norm and cardinality k. We use a convex relaxation for the sparse eigenvalue problem to produce a tractable statistic for this hypothesis testing problem and show that in a high-dimensional setting where the dimension n, the number of samples m and the cardinality k grow towards infinity proportionally, the detection threshold on θ remains finite. More broadly speaking, in the spirit of smoothed analysis [Spielman and Teng, 2001], this shows that analyzing the performance of semidefinite relaxations on random problem instances is sometimes easier and provides a somewhat more realistic description of typical approximation ratios. Another classical example of this phenomenon is a MAXCUT-like problem arising in statistical physics, for which explicit (asymptotic) formulas can be derived for certain random instances, e.g. the Parisi formula [Mezard et al., 1987, Mezard and Montanari, 2009, Talagrand, 2010] for computing the ground state of spin glasses in the SherringtonKirkpatrick model. It thus seems that comparing the performance of convex relaxations on random problem Date: June 18, 2012. 2010 Mathematics Subject Classification. 62H25, 90C22, 90C27.
منابع مشابه
Sparse Structured Principal Component Analysis and Model Learning for Classification and Quality Detection of Rice Grains
In scientific and commercial fields associated with modern agriculture, the categorization of different rice types and determination of its quality is very important. Various image processing algorithms are applied in recent years to detect different agricultural products. The problem of rice classification and quality detection in this paper is presented based on model learning concepts includ...
متن کاملA New IRIS Segmentation Method Based on Sparse Representation
Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...
متن کاملA New IRIS Segmentation Method Based on Sparse Representation
Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...
متن کاملSum-of-Squares Lower Bounds for Sparse PCA
This paper establishes a statistical versus computational trade-off for solving a basic high-dimensional machine learning problem via a basic convex relaxation method. Specifically, we consider the Sparse Principal Component Analysis (Sparse PCA) problem, and the family of Sum-of-Squares (SoS, aka Lasserre/Parillo) convex relaxations. It was well known that in large dimension p, a planted k-spa...
متن کاملSketching for Principal Component Regression
Principal component regression (PCR) is a useful method for regularizing linear regression. Although conceptually simple, straightforward implementations of PCR have high computational costs and so are inappropriate when learning with large scale data. In this paper, we propose efficient algorithms for computing approximate PCR solutions that are, on one hand, high quality approximations to the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Math. Program.
دوره 148 شماره
صفحات -
تاریخ انتشار 2014